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Abstract 

Background: The inference of homologies among DNA sequences, that is, positions in multiple genomes that share 
a common evolutionary origin, is a crucial, yet difficult task facing biologists. Its computational counterpart is known 
as the multiple sequence alignment problem. There are various criteria and methods available to perform multiple 
sequence alignments, and among these, the minimization of the overall cost of the alignment on a phylogenetic tree 
is known in combinatorial optimization as the Tree Alignment Problem. This problem typically occurs as a 
subproblem of the Generalized Tree Alignment Problem, which looks for the tree with the lowest alignment cost 
among all possible trees. This is equivalent to the Maximum Parsimony problem when the input sequences are not 
aligned, that is, when phylogeny and alignments are simultaneously inferred. 

Results: For large data sets, a popular heuristic is Direct Optimization (DO). DO provides a good tradeoff between 
speed, scalability, and competitive scores, and is implemented in the computer program POY. All other (competitive) 
algorithms have greater time complexities compared to DO. Here, we introduce and present experiments a new 
algorithm Affine-DO to accommodate the indel (alignment gap) models commonly used in phylogenetic analysis of 
molecular sequence data. Affine-DO has the same time complexity as DO, but is correctly suited for the affine gap edit 
distance. We demonstrate its performance with more than 330,000 experimental tests. These experiments show that 
the solutions of Affine-DO are close to the lower bound inferred from a linear programming solution. Moreover, 
iterating over a solution produced using Affine-DO shows little improvement. 

Conclusions: Our results show that Affine-DO is likely producing near-optimal solutions, with approximations within 
1 0% for sequences with small divergence, and within 30% for random sequences, for which Affine-DO produced the 
worst solutions. The Affine-DO algorithm has the necessary scalability and optimality to be a significant improvement 
in the real-world phylogenetic analysis of sequence data. 
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Background 

The inference of homologies among DNA sequences, that 
is, positions in multiple genomes that share a common 
evolutionary origin, is a crucial, yet difficult task fac- 
ing biologists. Its computational counterpart is known 
as the multiple sequence alignment problem. There are 
various criteria and methods available to perform multi- 
ple sequence alignments (e.g. [1-9]). Among these, given 
a distance function, to minimize the overall cost of the 
alignment on a phylogenetic tree is known in combinato- 
rial optimization as the Tree Alignment Problem (TAP) 
[10-15]. The TAP typically occurs as a subproblem of 
the Generalized Tree Alignment Problem (GTAP) which 
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looks for the tree with the lowest alignment cost among all 
possible trees [10] . The GTAP is equivalent to the Max- 
imum Parsimony problem when the input sequences are 
not aligned, that is, when phylogeny and alignments are 
simultaneously inferred. 

An important element in sequence alignment and phy- 
logenetic inference is the selection of the edit function, 
and in particular, the cost G(k) of a sequence of k consec- 
utive insertions or deletions, generically called indels (e.g. 
an insertion of 3 consecutive T (k = 3) in the sequence 
AA could create the sequence ATTTA. The same opera- 
tion in the opposite direction would be a deletion. The 
sequence alignment implied would be A- - -A/ ATTTA, 
where - represents an indel). G(k) can have a signifi- 
cant impact in the overall analysis [16,17]. There are four 
plausible indel cost functions described in the literature: 
G(k) = bk (non-affine) [18], G(k) = a + bk (affine) 
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[18], G(k) = a + blogk (logarithmic) [16,19-22], and 
G(k) = a + bk + clog/c (affine-logarithmic) [16]. Sim- 
ulations and theoretical work have found evidence that 
affine-logarithmic yields the most satisfactory results, but 
provides marginal benefits over the afflne function, while 
its time complexity is much greater [16]. For this reason, 
many biologists adopt the afflne indel cost function. (This 
topic is still a subject of controversy.) 

For large data sets, a popular heuristic is Direct Opti- 
mization (DO) [15]. DO provides a good tradeoff between 
speed, scalability, and competitive scores, and is imple- 
mented in the computer program POY [23,24]. For 
example, the alignment for the Sankoff et al. data set 
[11] produced by DO has cost 302.25, matching that of 
GESTALT [25] and SALSA [26]. Using an approximate 
iterative version of DO that has the same time complex- 
ity, POY finds a solution of cost 298.75, close to the best 
known cost of PRODALI (295.25) [27]. All other (compet- 
itive) algorithms have greater time complexities compared 
to DO (e.g [25-27]). An important limitation of DO, how- 
ever, is that it was not defined for affine edit distance 
functions. 

The properties of DO and the GTAP (DO+GTAP) for 
phylogenetic analysis were experimentally evaluated in 
[28]. The main conclusion of that study is that DO+GTAP 
could lead to phylogenies and alignments less accurate 
than those of the traditional methods (e.g. CLUSTALW 
+ PAUP*). The initial work of Ogden and Rosenberg [28] 
raised a number of important questions: Do the conclu- 
sions hold if a better fit heuristic is used for the tree 
search in the GTAP? What would be the effect of using 
an afflne edit distance function? How do the hypothe- 
sis scores compare among the different methods? These 
questions have since been answered in various followup 
papers. 

In [29], the author found that the opposite conclusion 
can be drawn when a better fit heuristic for the GTAP is 
used. That is, when the resulting tree is closer to the opti- 
mal solution, DO+GTAP is a superior method. Moreover, 
a good fit heuristic is a fundamental aspect in phylogenetic 
analysis that cannot be overlooked. 

Although [28] performed simulations under afflne gap 
costs, the study used the non-afflne distance functions 
described for DO at the time of publication. Whether 
or not a different distance function could yield dif- 
ferent conclusions was tackled in [17]. The authors 
found that when using the GTAP as phylogenetic anal- 
ysis criterion under the afflne gap cost function, the 
resulting phylogenies are competitive with the most accu- 
rate method for simulated studies (i.e. Probcons using 
a ML analysis under RaxML) [17]. It is important to 
note that [17] used an early implementation of the algo- 
rithms presented in this paper (available in POY version 
4 beta). 



A comparison of the tree scores of various methods 
was recently performed in [30] and is implicit in some 
of the conclusions of [17]. The authors concluded that 
when using a heuristic fit for the GTAP, the hypotheses 
have scores better than those produced by other methods. 
Therefore, without hindsight (i.e., when accuracy can- 
not be measured), biologists would prefer the hypotheses 
generated under the GTAP. 

In this paper, we introduce and present experiments for 
a new algorithm Affine-DO. Affine-DO has the same time 
complexity of DO, but is correctly suited for the afflne gap 
edit distance. We show its performance experimentally, as 
implemented in POY version 4, with more than 330,000 
experimental tests. These experiments show that the solu- 
tions of Afhne-DO are close to the lower bound inferred 
from an Linear Programming (LP) solution. Moreover, 
iterating over a solution produced using Afflne-DO has 
very little impact in the overall solution, a positive sign of 
the algorithms performance. 

Although we build Afflne-DO on top of the successful 
aspects of DO, DO has never been formally described, nor 
have its basic properties been demonstrated. To describe 
Afflne-DO, we first formally define DO and demonstrate 
some of its properties. 

Related Work 

The TAP is known to be NP-Hard [31]. Due to its diffi- 
culty, a number of heuristic methods are applied to pro- 
duce reasonable (but most likely suboptimal) solutions. 
The first heuristic techniques [11,12] consist of iteratively 
improving the assignment of each interior vertex as a 
median between the sequences assigned to its three neigh- 
bors. This method can be applied to any initial assignment 
of sequences and adjust them to improve the overall tree 
cost. In recent work, Yue et al. [32] used this algorithm 
in their computer program MSAM for the tree alignment 
problem, using as initial assignment the median computed 
between the 3 closest leaves to the interior vertex (ties 
arbitrarily resolved). 

Hein [13,14], designed a second heuristic solution which 
is implemented in the TreeAlign program. In TreeAlign, 
sets of sequences are represented by alignment graphs, 
which hold all possible alignments between a pair of 
sequences. The complete assignment can be performed in 
a post-order traversal of a rooted tree, where each vertex is 
assigned an alignment graph of the two closest sequences 
in the alignment graph of its two children vertices. The 
final assignment can be performed during a backtrack 
on the tree. Although this method is powerful, it is not 
scalable (e.g. using TreeAlign to evaluate one of the simu- 
lations used in this study did not finish within 48 hours). 
Moreover, the TreeAlign program does not allow the user 
to fully specify the distance function. This algorithm was 
later improved by Schwikowski and Vingron, producing 
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the best tree alignment known for the Sankoff data set 
[33]. 

The most important theoretical results for the TAP are 
several 2 approximation algorithms, and a Polynomial 
Time Approximation Scheme (PTAS) [34-37]. These algo- 
rithms solve the TAP from a theoretical perspective, but 
the execution time of the PTAS renders it of no practical 
use. On the other hand, the 2-approximation algorithms 
have shown very poor performance when compared to 
heuristic methods such as that of TreeAlign. 

Direct Optimization (DO) [15] is a heuristic imple- 
mented in the computer program POY [23,24,38], which 
yields good execution times and competitive alignment 
costs. Given that DNA sequences have a small alphabet 
(4 elements extended with an indel to represent inser- 
tions and deletions), DO represents a large number of 
sequences in a compact way by using an extended alpha- 
bet (potentially exponential in the sequence length). In the 
spirit of the TreeAlign method, DO heuristically assigns 
to each vertex, during a post order traversal, a set of 
sequences in an edit path connecting two of the closest 
sequences assigned to the child vertices. Subsequently, in 
a pre-order backtrack, a unique sequence is assigned to 
the interior vertices to produce the solution. 

DO can be implemented with a time complexity of 
0(n 2 \ V|), where n is the length of the longest sequences, 
and V is the vertex set of the tree. For larger alphabets the 
total time complexity is 0(# 2 |V||E|), where |E| n is 
the alphabet. 

Schwikowski and Vignon [27] published the best heuris- 
tic algorithm for the TAP, as implemented in the PRO- 
DALI software. Although powerful, this algorithm has 
a potentially exponential time and memory complexity, 
which in turn makes it non-scalable and difficult to use for 
the GTAP. Moreover, PRODALI is not publicly available. It 
is for these reasons that DO is the algorithm of choice for 
the GTAP, yielding slightly weaker tree cost approxima- 
tions when compared to those of PRODALI, but suitable 
for better performance on larger data sets. 

Results and discussion 

Direct Optimization 

Direct Optimization (DO) has only been described infor- 
mally in the literature [15,38], and to build on it, we must 
first fill this gap. At the core of the algorithm is the use 
of an extended alphabet to represent sets of sequences. 
We begin by exploring the connection of this method with 
those using a tree alignment graph. 

In TreeAlign and PRODALI, the set of optimal align- 
ments between sequences are represented in an alignment 
graph. These graphs are aligned at each vertex in the tree 
to find their closest sequences. An alignment graph is then 
computed between these sequences, and assigned to a 
vertex of the tree. The alignment between a pair of such 



graphs, however, is an expensive computation, both algo- 
rithmically (0(n 4 )), and in its implementation. PRODALI 
is more expensive in practice, as it not only stores the set 
of optimal, but also suboptimal alignments. 

In DO, not all the possible alignments are stored, but 
only one. However, it stores all the possible sequences 
that can be produced from this alignment. We will call 
such set of sequences a reduced alignment graph (RAG). 
Thanks to their simplicity, DO use a more compact rep- 
resentation of a RAG, to permit greater scalability than 
that of TreeAlign or PRODALI. DO represents them as 
sequences in an extended alphabet by which we can then 
represent a complete RAG with an array. 

It is then possible to align RAGs, find the closest 
sequences contained in them, and compute their RAG 
with time complexity 0(n 2 ). The following section for- 
malizes these ideas. 

Sets of Sequences, Edition Distance, and Medians 

The first goal is to find a compact representation of sets of 
sequences produced in a pairwise alignment. For example, 
the alignment ATTG A- -C is represented in an alignment 
graph shown in Figure 1. Such graph can then be extended 
to include intermediate sequences such as ATG or ATC 
(Figure 1). 

The same information can be efficiently stored by using 
an extended alphabet Ep = ^(E) \ {0} that includes 
all the subsets of E with the exception the empty set, as 
follows 

{A}{T, indel}{T, indel}{G, C}. 

We call such representation a Reduced Alignment Graph 
(RAG). Notice that all the intermediate sequences can 
be produced by selecting an element from each set in 
the RAG, and removing all the indels from the resulting 
sequence. If a sequence can be generated by following this 
procedure, then we say that the sequence is included in 
the RAG. The example then includes the sequences ATTG, 
ATTC, ATC, ATG, AC, and AG. 

Observation 1. Let A e Tip be a RAG. Then there are 
YlxeA W sequences contained in A. 

In the original problem definition, we are given a dis- 
tance d between the elements in E. Let dp(A,B) = 
min ae A,beB d(a, b), be the distance between elements in 
Ep. The following observation is by definition: 

Observation 2. For allA,B e Ep, there exists an a e A 
and b e B such that dp(A, B) = d(a, b). 

Define the RAG edit distance by setting d = dp in 
Equation 1. 
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Figure 1 Alignment graphs. Graphs representing the alignment 
ATTG, A- -C. a. A plain alignment graph, b. An alignment graph 
that contains more potential sequences. 



The sequence edit distance can be computed using 
dynamic programming [39], following the recursive func- 
tion: 



e(Ai...i-i,B lmmmj -i) + d(Ai,Bj) 
e(Ai„j-i,Bi mm j) + d{Au indel) 
e(Ai...;,i>i...y_i) + d(Bj, indel) 

(i) 



with base cases ()) = 0, and e({),A) = e(A, (>) = 
Si<i<|A| d(Ai, indel). The afflne case is more elaborate 
but possesses the same spirit and time complexity [40] . 

We will show that we can find efficiently the closest 
sequences in a pair of RAGs, as well as their edit distance. 
Thanks to these properties, a RAG is used instead of an 
alignment graph, to bound the cost of a tree with lower 
time complexity. 

Lemma 1. For all RAGs A, B, there exists sequences U, 
V such that U is contained in A, V is contained in B, and 
e(A,B) = e(U, V). 



Proof. We define a procedure to produce U and V. Start 
with an empty U and V, and follow the backtrack of 
Equation 1. For each case, prepend the following to U 
and V: 

case 1 Select an element x e X{ that holds 
Observation 2 and prepend it to U. Then find an 
element y e Y# that is closest to x and prepend it to 
V. From Observation 2 we know that 
d(x,y) = dp(Xi, Yj). 

case 2 Select an element x e X{ closest to indel and 
prepend x to U and indel to V. Again from 
Observation 2 we know that 
d(x, indel) = dp{Xu {indel}). 
case 3 Symmetric to case 2. 

□ 

Observe that the overall time complexity remains 
as in the original Needleman-Wunsch algorithm [39] . 



The DO Algorithm 

DO(7", x). Direct Optimization 

DO (Algorithm 1) estimates the cost of a tree by proceed- 
ing in a post-order traversal on a rooted tree, starting at 
the root p, and assigning a RAG to each interior vertex. 

Data: A binary tree T with root p 

Data: An assignment x : L(T) -> X of sequences to 

the leaves L of T 

Data: S(v) holds a set of sequences for vertex v, 
initially empty for every vertex 

Result: cost holds an upper bound of the cost of T, x 
begin 

cost 0 

foreach level of T, with the bottom level first do 
foreach node v at the level do 

if v is a leaf (has no children) then 

S(v) +- (a h ai = {x(v)/}> 
else 

Data: v has children u and w 
cost cost + ep(S(u), S(w); 
U,W <— the alignment of S(u) and 
S(w)) respectively; 

S(v) <r- m P (U, W); 
end 
end 

return cost 
end 
end 

We have not defined yet mpiU, W) to compute each 
RAG. Let m(X, Y) m(X, Y) be the set of elements in X and 



Varon and Wheeler BMC Bioinformatics 201 2, 13:293 
http://www.biomedcentral.eom/1 471 -21 05/1 3/293 



Page 5 of 14 



Y that realize the distance dp(X, Y). Let the RAG between 
two aligned RAGs A,B e E*, \A\ = \B\ = n be 

m P (A,B) = (Xi = rn(A h Bi)). 

Without loss of generality, assume from now on that for 
all x G £ \ {indel}, d(x, indel) = b for some constant b. 

Lemma 2. Let C = mp(A,B). Then for all X included in 
C, there exists Y and Z included in A and B respectively, 
such that e P (A,B) = e(Y,Z) = e(X, Y) + e(X,Z). More- 
oven Y and Z are (some of) the closest sequences to C that 
are contained in A and B respectively. 

Proof Follows directly from the median definition and 
Lemma 1. □ 

Lemma 2 is important for the correctness of the DO 
algorithm. It shows that for every sequence contained in 
C, there are corresponding sequences in A and B of edit 
distance equal to ep(A,B). This lemma can then be used 
in the DO algorithm to delay the selection of a sequence 
from each RAG, and use ep directly to calculate the over- 
all cost of the tree. Without it, ep cannot be used for this 
purpose directly. 

Definition 1. Compatible assignments Two assignments 
X : V -> E* and x' : V —> E* are compatible if both 
assign the same sequences to corresponding leaves, that is, 
for all v e L, x(v) = x'(v). 

The following Theorem shows that the tree cost com- 
puted by DO is feasible: 

Theorem 1. There exists an assignment of sequences x' 
compatible with x such that 

DO(T, X )= e(x'(u), X '(v)). 

(u,v)eE 

Proof Let T have root vertex p. Call x' the final assign- 
ment of sequences to the vertices of T Select any X 
included in S(p) and set x'(p) <- X. Then for each other 
vertex v with parent p, following a pre-order traversal 
starting at p, let x(v) <- X where X e E* is included in 
S(v) and is closest to x (^)- From Lemma 2, we know that 
for any selection at p there exists a selection in its children 
that would yield the additional cost computed at p during 
the DO algorithm. Moreover, at each pre-order traversal 
step, we assign to each vertex v the closest sequence to 
x'ip) included in S(v). Again from Lemma 2, we know that 
the total cost of the two edges connecting p with its chil- 
dren must be greater than or equal to the additional cost 
computed for vertex p in the DO algorithm. Therefore, 

do(T, X ) > E ( „,v )6 £m e ^'(")> x'(v)). □ 



DO is weaker than the alignment graph algorithms 
[14,27,33], as the later techniques maintain the set opti- 
mal edit paths between sequences, or a superset including 
it. However, in these algorithms the overall execution 
time and memory consumption requirements could grow 
exponentially [27]. In contrast, DO maintains a polyno- 
mial memory and execution time, making it more scalable, 
with competitive tree scores. Moreover, DO can be effi- 
ciently implemented thanks to the simplicity of the data 
structures involved. 

The Affine Gap Cost Case 

In practice, biologists use DO because of its scalability 
and competitive costs. However, the DO algorithm was 
denned for the non-afflne distance functions (G(k) = bk), 
and does not work correctly for the popular affine indel 
cost model [18] (G(/c) = a + bk). Under many parameter 
sets, DO could produce worse tree cost estimations than 
those of the Lifted Assignment if used under the affine gap 
cost model (non published data). The fundamental reason 
for this problem is that Lemma 2 does not hold for the 
affine gap cost (e.g. Figure 2), and therefore, ep cannot be 
directly used to correctly bound the cost of a tree. 

To overcome this problem, we extend Gotohs algorithm 
[40] to compute distances heuristically for sequences in 
Ep, and define a new median sequence. With these tools, 
we modify DO so that Lemma 2 still holds to compute tree 
cost bounds. 

Heuristic Pairwise RAG Alignment 

Let A and B be a pair of RAGs to be aligned. Define 
the affine edit distance function, analogous to ep, using 4 
auxiliary matrices (g, d, v, and h), as 

eBff P (M...i,Bi...j) = min{g[ , d[ , v[ , h[ }. 

The matrices g,d,v } and h will be filled recursively. 
Before denning them formally, the basic intuition of the 
procedure is that g[ i,j] is the cost of an alignment where 



AAA TTT 

0 + 0/ | s^J + 5 

AAA TTT AA ACCCCCT TT AAACCCCCTTT 

7 + \ ^ 7+2 

AAACCCTTT 

Figure 2 Example of suboptimal median. Let G(k) = 7 + k. The 
center sequence is the median for the alignment of the left and right 
sequences. (The underscored C represents {C, indel}) Although the 
upper and lower sequences are included in the median, the lower 
one is not in an optimal edit path connecting left and right. This 
example shows Lemma 2 does not hold for affine gap costs. 
Therefore, there are sequences in this RAG that cannot be used 
directly in the DO algorithm without an extra cost, not computed by 
ep. It follows that DO, if used directly for the affine gap cost case, can 
compute an incorrect cost for a given tree. 
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Ai and Bj align elements other than an indel. d[ Uj] is the 
cost of an alignment using indel elements in A[ and Bj. 
v[ Uj] is the cost of an alignment where we use a "vertical" 
indel block by aligning Bj with an indel Finally, h[i,j] is 
the cost of an alignment where we use a "horizontal" indel 
block by aligning Ai with an indel 

To compute these values, we define a number of 
accessory functions. The cost of a pure substitution 
substiX, Y) = dp(X \ {indel}, Y \ {indel}). Symmetric to 
the substitution cost, we need the cost of extending a gap 
when indel e A,B c E: 



diagiX, Y) 



0 if indel e X&ndindel e Y 
oo otherwise. 



There are three remaining accessory functions required 
to compute the matrices g, h, v, and d. Each function han- 
dles various cases where a or b needs to be added. The first 
function, go(A, i) evaluates whether or not it is necessary 
to add a gap opening value when aligning Ai with a gap: 

0 if i = 1 and indel e Ai 
go (A, i) = 0 if i > 1 and indel £ A;_iand indel e Ai 
a otherwise. 

The second function go' {X, Y) calculates the extra cost 
incurred when not selecting an indel in one of the 
sequences means splitting an indel block: 



go\X, Y) = subst(X, Y) + 



0 if indel £ X 
a otherwise. 



The third, and final accessory function, computes what 
would be the extra cost of extending an indel, that is: 



geiX) = 



0 if indel e X 
b otherwise. 



Finally, the recursive functions for the cost matrices is 
denned as: 



h[i,j] = min 

v[i,j] — min 
d[ Uj] 



g[ i - hj - 1] +subst(Ai,Bj) 
d[ i - hj - 1] +subst(Ai,Bj)+ 

go(A,i)+go(B,j) 
v[i-l,j-l]+go'(Bj,Ai) 
h[i-l,j-l]+go'(A i9 Bj), 

h[i,;-l]+ge(Bj) 
d[i,j-l]+ge(Bj)+go(B,j), 

. \v[i-l,;]+ge(Ai) 
min { 

[d[i-l,f\+ge(Ai)+go(A,i), 

diag{A h Bi) + 

\d[i-l t j-l] 
min < 

\g[i- hj - 1] +go(A, i) + go(B,j), 



(2) 



(3) 

(4) 
(5) 
(6) 



with base cases^[ 0, 0] = 0, d[ 0, 0] = oo, v[ 0, 0] = go (A, 1), 
h[ 0, 0] = go(B, l),g =[ 0, /] = d[ 0, /] = v[ 0, /] = 
oo, A [0,/]= h[0,i - l]+ge(Bi),l < / < \B\,v[j,0] = 
v[j - 1,0] +ge(Aj), and^[;,0]= d[j,0]= h[j,0]= oo, 
l<j<\A\. 

The following theorem shows that if we align a pair 
of sequences in A, B, then we can bound the cost of the 
closest pair of sequences included in them. 

Theorem 2. There exists a sequence X contained in A 
and a sequence Y contained in B suck that e a ff p (A,B) > 
eajjiX, Y). 

Proof. We are going to create a pair of sequences X and 
Y contained in A and B respectively that have edit cost at 
most e a ff p (A,£). To do so, follow the backtrack that yields 
e dftp(A,B), and at each position i and j in the aligned A 
and B assign X^ and Y/ 0 where k is the alignment position 
corresponding to the aligned X/ and Yj as follows: 

1- g[ Uj] is the cost of aligning A and B\_j when a 
non-indel element of Ai and Bj is aligned. If the 
backtrack uses g[ Uj] then assign to Xi and Yj the 
closest elements in Ai \ indel and Bj \ indel. Observe 
that all the cases in Equation 2 align a non-indel 
element from Ai and Bj, and add a cost that is always 
greater than or equal to substiA^Bj) = d(Xi, Yj). 

2. h[ Uj] is the cost of extending an indel in the 

horizontal direction. Therefore, select X^ = indel, 
and 



indel if indel e Bj 
y,y e Bj otherwise. 



If Yk = indel, then the alignment of Xk and Yk causes 
no additional cost in the particular alignment being 
built between X and Y. Otherwise, then there is an 
extra cost, of at least the b parameter, which both 
cases of Equation 3 account for. Additionally, if the 
previous pair of aligned elements are a pair of indels 
(second case in 3, see below for the treatment of this 
option), then an extra indel opening cost is added. 
3. v[ /,/'] is the cost of extending an indel block in the 
vertical direction. The treatment is symmetric to that 
of h, with Yk = indel and 



X k = 



indel 



if indel e Ai 



x,x e Ai otherwise. 



4. d[ i,j] is the cost of extending an indel in the diagonal 
direction, that is, when both A and B hold indels, and 
those indels are being selected during the backtrack. 
Equation 6 ensures that this choice is only possible 
by assigning oo whenever at least one of Ai or Bj does 
not contain an indel Otherwise, if this option is 
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IZZZZZD 

debits = credits= 0 



debits = credits = 
a + bl 



Figure 3 Credits and debits in the simple cases, credits and debits incurred by the different possible arrangements of subsegments with 
matching limits in S(p), S(u), and S(v). The only cases with credits = debits > 0 (in the right box) represents with filled boxes the assignments that 
would yield an indel block. 



selected, then simply assign indel to both and 
with no extra cost for the alignment of X and Y. 

□ 

The Main Algorithm: Affine-DO 

We will now use e a ff p (A,B) to bound the cost of a tree 
using a post-order traversal, in the same way we did with 
DO (Algorithm 1). In order to do so a RAG to be assigned 
on each step must be defined (i.e. the function mp in Algo- 
rithm 1). To create the RAG M (initially empty), do as 
follows in each of the 4 items described in the proof of 
Theorem 2: 

1. If we selected two indels in X^ and Y/ a don't change 
M 

2. If Xj< = indel and Y^ ^ indel, then prepend 
{indel} U Bj to M 



3. If Xfc ytz indel and Y^ = indel, then prepend 
{indel} U A t to M 

4. If Xk indel and Y^ ^ indel, then prepend 

{x e Ai, for some y e Bj, d(x,y) = d(X/ a Y/ c )} + 
Bj, for some x e Ai, d(x,y) = d(X^ Yj<)} to M. 

5. Once the complete M is created, remove all the 
elements M{ = {indel} to create the indel-less RAG 
M We call M the RAG produced by m^ p {A,B). 

Definition 2. Affine-DO Affine-DO is Algorithm 1, mod- 
ified by replacing mp with m a ff P > an d e P w ^ n e aff P - 

It is now possible to use the Affine-DO algorithm to 
bound heuristically the cost of an instance of the TAR 

Theorem 3. Given a rooted tree T with root p, and an 
Affine-DO assignment S : V(T) -> E^, there exists an 



S(v)[ 




]S(u) 




Figure 4 Credits and debits in the complex cases. In the upper part, overlapping blocks of type B in S(u) and S(y), with a complex pattern of 
insertions and deletions in S(p). The total credits added at S(p) by Affine-DO can be transferred to u and v. In the lower, the credits transferred to v 
can be assigned to m individual insertion blocks (solid boxes), and one deletion block (dashed empty box) which maintain debits > credits. 
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Table 1 Simulation parameters 



Parameter 


Values Evaluated 


Substitution Rate 


1.5 


Average Branch Length 


0.05,0.1,0.2,0.3,00 


Max. Gap 


1,2,5,10,15 




70,100,150,200, 


Root Sequence Length 




300,400,500,1000 



All combinations of parameters were employed to generate the test data sets. 
The branch length variation equals the average branch length. 



assignment x' • V(T) —> X* such thatX — x'(p) an d the 
cost computed by Affine-DO equals that implied by x' . 

Proof. If there are no indels involved in the tree align- 
ment, then the arguments of Theorem 1 would suffice. 
Hence, we now concentrate on the cases that involve 
indels. 

To prove those remaining cases, we will use induction 
on the vertices of the tree. To do so, we will count the cred- 
its that each vertex adds to the subtree it roots as added 
by the Affine-DO algorithm. The credits represent the 
maximum total cost of the indels involved in a particular 
subtree; we will compare them with the debits incurred by 
a set of indels, and verify that the credits are always greater 
than or equal to the debits. To simplify the description, we 
will call type A subsequences of maximal size holding only 
indels, and type B subsequences of maximal size holding 
sets that include, but are not limited to, indels, and type 
C maximal subsequences holding sets with no indel. We 
will count without loss of generality the credits and deb- 
its within those subsequences. In Figures 3 and 4, Type A 
is represented as a line, type B as a box with a center line, 
and type C as an empty box. 

For the inductive step, consider the leaves of the tree. 
By definition, for all vgI, S(v) can contain subsequences 
of neither type A nor B, as there are no indels allowed. 
Therefore, the theorem holds true, with a credits = 
debits = 0. 

Consider now the interior vertex v, with children u and 
v. In Figure 3, all the simple cases where the limits of the 



subsequences in S(u) and S(v) match those of S(p). It is 
straightforward to see that in all those cases credits = 
debits. 

Consider now the more difficult case when the blocks 
do not have exact limits. Assume without loss of generality 
that S(u) and S(v) have a segment of type B, and S(p) has 
in the corresponding segment a series of blocks of type A 
and C (Figure 4). (There can be no subsequences of type 
B in S(p) aligned with those of type B in S(u) and S(v) as 
Waffp does not allow it.) 

The total credit granted by Equation 2 is c > 2ma + 
2bY^JLi We can transfer c/2 to u (v), so that in one 
edge rooted by u (v), a series of insertions correspond- 
ing to the subsequences si, 52, . . . , s m can occur (Figure 4), 
while the other branch supports a single deletion of length 
/ — YllLi s i (Figure 4 lower, upper dashed box). The total 
debit of these events now rooted in u would be 

m m 

aim + 1) + bJ2 s i + b V ~J2 Si>} <c/2 + * + M. (7) 

i=l i=l 

By the inductive hypothesis, the subtree rooted by u (v) 
has credits > debits, and from Equation 7 we also have 
that credits > debits in p, therefore the theorem holds, 
and the overall tree rooted by p has a sequence assign- 
ment of cost at most that computed by the Affine-DO 
algorithm. □ 

Theorem 4. TfE is small, then Affine-DO has time com- 
plexity 0(n 2 \V\) time, otherwise the time complexity is 
0(n 2 \V\\X\). 

Proof. If the alphabet is small, then m^ p and dp can be 
pre-computed in a lookup table for constant time com- 
parison of the sets. For large alphabets the maximum 
size of the sets contained in £p can be made constant. 
Otherwise, a binary tree representation of the sets would 
be necessary, adding a | £ | factor to the set comparison. 
Each heuristic alignment can be performed using dynamic 
programming, with time complexity 0{n 2 ) where n is 
the maximum sequence length (Ukkonens [41] algorithm 




Figure 5 Approximate DO. An iteration of the approximated iterative improvement. To improve x, Affine-DO is used to produce X],X2, andx3 in 
the three possible rooted trees with leaves u,v, and w. If the best assignment x\ yields better cost than the original x, then it is replaced, otherwise 
no change is made. 
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makes no obvious improvement as insertions and dele- 
tions could have cost 0 when aligning sequences in X^). 
Each alignment must be repeated for \V\ vertices dur- 
ing the post-order traversal, yielding the claimed time 
complexity. □ 



Experimental Evaluation 

In this section, we describe the methods used to gener- 
ate the instance problems, assess the solutions generated 
by each algorithm, and compare the algorithms. This 
allows the assessment of the performance of each algo- 
rithm, Amne-DO in greater detail, and an evaluation of 
Amne-DO using exact solutions for trees with only 3 
leaves. 



Data Sets 

To generate the instance problems, We simulated a num- 
ber of sequences using DAWG 1.1.1 [42] with insertions 
and deletions following a power law distribution. The 
simulations followed random binary trees of 50 leaves 
comprising all the combinations of the parameters listed 
in Table 1. These produced a total of 96,000 indepen- 
dent simulations. For each data set, we collected the true 
sequence assignment. This information allows the com- 
parison of the cost calculated by Afnne-DO with the cost 
implied by the true sequence of events. Our expectation 
was to produce costs lower than those using the true 
sequence assignment. 

Solution Assessment 

The sequences assigned by the simulation can be far from 
the optimal solution. To evaluate Amne-DO, we used two 
algorithms: the standard Fixed States algorithm, which is 
known to be a 2-approximation, and the cost calculated 
by the solution of an LP instance of the problem. A good 
heuristic solution should always be located between these 
two bounds. As a comparison measure for each solution, 
the ratio between the solution cost and the LP bound was 
computed. The closer the ratio to 1.0, the better is the 
solution. 

This form of evaluation has the main advantage (but 
also disadvantage), of being overly pessimistic. Most 
likely, the LP solution is unachievable, and therefore, the 
approximation ratio inferred for the solution produced 
by Affine-DO will most likely be an overestimate. To 
assess how over-negative the LP bound is, we produced 
2100 random sequences divided in triplets of lengths 
between 70 and 1000. For each triplet, the Affine-DO, 
the LP bound, and the exact solution were computed. 
These three solutions were compared to provide an 
experimental overview of the potential performance of 
our algorithm. We selected random sequences because 




Figure 6 Algorithm comparison. General patterns observed in the 
approximation ratio of the different algorithms. Simulation is the 
simulated data, ADO is Affine-DO, Approx. and Exact IADO are the 
approximated and the exact iterative Affine-DO algorithms 
respectively, initial and final MSAM are the initial and final estimations 
of the MSAM algorithm, a. substitutions = 1, a = 0, b = 1, branch 
length=0.05. b. substitutions = 4, a = 3, b = 1, branch length=0.05. 
c. substitutions = 4, a = 3, b = 1 , branch length=0.3. 
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preliminary experiments showed evidence that these pro- 
duce the most difficult instances for Affine-DO. 

Algorithms compared 

We implemented a number of algorithms to approximate 
the tree alignment problem. Our implementation can be 
divided in two groups: initial assignment, and iterative 
improvement. 

Initial Assignment includes the Fixed States (a stronger 
version of the Lifted Assignment [34,43]), Direct Opti- 
mization [15], and Affine-DO. Each of these algorithms 
starts with a function x and creates a x' compatible with 
X which is an instance solution. DO and Affine-DO have 
already been described. The Fixed States [43] is a simple 
algorithm were the interior vertices are optimally assigned 
one of the leaf sequences of the input tree, yielding a 
2-approximation solution [34]. 

Iterative Improvement modifies an existing x' by read- 
justing each interior vertex using its three neighbors. This 
procedure is repeated iteratively, until a (user provided) 
maximum number of iterations is reached, or no further 
tree cost improvements can be achieved. The adjustment 
itself can be done using an approximated or an exact three 
dimensional alignment, which we call the Approximate 
Iterative and Exact Iterative algorithms. Approximate Iter- 
ative (Figure 5), uses DO or Affine-DO (the selection 
depends on which kind of edit distance function is used) 
to solve the TAP on the three possible rooted trees formed 



by the three neighbors of the vertex used as leaves. The 
assignment yielding the best cost is selected as the new 
center. The exact three dimensional alignment has time 
complexity 0(n 3 ) [44]. Our implementation uses the low 
memory algorithms implemented by Powell [44], though 
they can be improved to 0(n 2 ) memory consumption 
[45]. 

We compared MS AM [32], Affine-DO, Approximate 
Iterative, Exact Iterative, and Fixed States, using a lower 
bound computed with an LP solution. We do not include 
DO in the comparisons because it could not solve this 
problem [46]. It is therefore impossible to compare it 
directly with our algorithm. GESTALT, SALSA, and PRO- 
DALI were unavailable, and so, could not be used in 
our comparative evaluation. TreeAlign did not produce 
a solution for the simulations within 48 hours of exe- 
cution time, and therefore, was not included in the 
comparisons. 

In total, more than 330,000 solutions were evaluated. 
We only present those results that show significant dif- 
ferences, and represent the overall patterns detected. The 
Exact Iterative algorithm was only evaluated for the short 
sequences (70 to 100 bases), due to the tremendous exe- 
cution time it requires. Fixed States followed by itera- 
tive improvement is not included because its execution 
time is prohibitive for this number of tests (POY version 
4 supports this type of analysis). Nevertheless, prelimi- 
nary analyses showed that this combination of algorithms 
produce results in between Fixed States and Affine-DO, 
but not competitive with Affine-DO. 



Table 2 Numerical comparison of a pair of parameter combinations that represents the variation observed between the 
different algorithms 



Subst. Gap Op. 


Branch Len. 


Algorithm 


Min. 


Median 


Max 






Simulated 


1.088 


1.218 


1.337 


1 0 


0.05 


Fixed States 


1.275 


1.534 


1.755 






ADO 


1.044 


1.148 


1.215 






ADO + Iter. 


1.044 


1.123 


1.202 






Simulated 


1.731 


2.022 


2.396 


1 0 


0.3 


Fixed States 


1.621 


1.725 


1.816 






ADO 


1.314 


1.398 


1.453 






ADO + Iter. 


1.300 


1.377 


1.393 






Simulated 


1.108 


1.262 


1.415 


4 3 


0.05 


Fixed States 


1.302 


1.557 


1.766 






ADO 


1.084 


1.208 


1.312 






ADO + Iter. 


1.067 


1.171 


1.283 






Simulated 


2.012 


2.284 


2.611 


4 3 


0.3 


Fixed States 


1.688 


1.795 


1.868 






ADO 


1.433 


1.500 


1.542 






ADO + Iter. 


1.388 


1.442 


1.453 



Each individual indel has cost 1 . 
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Branch Lengths! 
m 0.05 
□ 0.1 
IS 0.3 



1.0 1.2 1.4 1.6 1.8 

Guaranteed Approximation 




I 1 1 1 1 

1.0 1.2 1.4 1.6 1.8 

Guaranteed Approximation 




1.0 1.2 1.4 1.6 1.8 



Guaranteed Approximation 

Figure 7 Affine-DO vs. Theoretical LP bound. Guaranteed 
approximation ratio of Affine-DO compared with the theoretical LP 
bound, for different cost and sequence generation parameters, a. 
substitutions = 1 , a = 0, b = 1 . b. substitutions = 2, a = 1,b = 1 . c. 
substitutions = 4, a = ],b = 3. 



Algorithm Comparison 

The most important patterns observed between the eval- 
uated algorithms are presented in Figure 6 and Table 2. 
In general, Affine-DO yields a better approximation than 
Fixed States. According to the density histograms (data 
not shown), the expected approximation ratio of 1.1 
(versus 1.5 for Fixed States) in the best parameter com- 
bination, and 1.5 (versus 1.7) for the worst. Iterative 
improvement (both in exact and approximated forms) has 
a small overall impact in the approximation ratio (with a 
maximal decrease of 0.05 when compared with the solu- 
tion inferred by Affine-DO alone). In all cases, Affine-DO 
found better solutions than the simulations (Table 2). 

Although the combination of Affine-DO and Itera- 
tive improvement produces better solutions, its execution 
time is dramatically higher. In the current implementa- 
tion, running on a 3.0 Ghz, 64 bit Intel Xeon 5160 CPU 
with 32 GB of RAM, Affine-DO evaluates each tree in less 
than 1 second in the worst case, while Affine-DO + Iter- 
ative improvement may take more than 1 hour per tree. 
For this reason, Affine-DO is well suited for heuristics that 
require a very large number of tree evaluations such as 
the GTAP, where millions of trees are evaluated during a 
heuristic search. 

Approximation of Affine-DO 

Figure 7 shows the density histogram of the guaranteed 
approximation of the Affine-DO algorithm when com- 
pared with the LP theoretical solution for a representative 




1.30 1.35 1.40 1.45 1.50 1.55 1.60 



Guaranteed Approximation 

Figure 8 Affine-DO vs. Theoretical LP bound with random 
sequences. Guaranteed approximation of Affine-DO for random 
sequences. In the left substitutions=1 , a = 0, b = 2, in the center 
substitutions^ , a = 0, b = 1 , and in the right substitutions=2, o = 1 , 
6=1. These are representative of the distributions observed in the 
experiments. 
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1.00 1.05 1.10 1.15 1.20 1.25 1.30 

Approximation 




^ Estimated Approximation with LP 
^ True Approximation 



Approximation 



9 i C 




1.00 



1.05 



1.20 



1.25 



1.30 



1.10 1.15 

Approximation 

Figure 9 Affine-DO vs. exact solution. Tightness of the Affine-DO 
solution according to the LP bound compared to the exact 
approximation. Observe that even for a very small data set, the LP 
bound is not realistic, and Affine-DO is close to the optimal solution, 
a. substitutions = 1 , a = 0, b = 1 b. substitutions = 2, a = \,b = 1 . c. 
substitutions = 4, a = ],b = 3. 



set of parameters. The results show that Affine-DO has 
a guaranteed approximation of less than 60% in every 
case. 

Typically, the larger the sequence divergence, the larger 
is the approximation degree of Affine-DO. The same pat- 
tern is observed for larger a. To test an extreme case, were 
the branch length is maximal, we evaluated the behavior 
of random sequences in the same set of trees. Figure 8 
shows the results of this experiment. 

The worst case is observed with an average approxi- 
mation slightly over 1.5. This variation, however, could 
have been caused by a more relaxed LP bound, which 
could be producing an overly pessimistic evaluation of 
the algorithm. To assess the importance of this factor, we 
evaluated its tightness experimentally. 

Comparison with an exact solution 

To assess Affine-DO and the tightness of the LP bound, we 
computed the exact solution for 700 unrooted trees con- 
sisting of 3 leaves with random sequences assigned to their 
leaves, under all the parameter sets tested. Figure 9 shows 
the density histograms for the results obtained. 

Note that the LP-inferred bound is overly negative even 
for these small test data sets, with the inferred approxima- 
tion expected at around 1.15, while in reality Affine-DO 
finds solutions that are expected to approximate within 
1.05 of the optimal solution, a 10% difference for trees 
consisting of only 3 sequences. 

Conclusions 

We have presented a novel algorithm that we have 
called Affine-DO for the TAP under affine gap costs. 
Our experimental evaluation, the largest performed for 
this kind of problem, shows that Affine-DO performs 
better than Fixed States. However, we observed that 
the LP bound is too pessimistic, producing unfeasible 
solutions 10% worse, even for the smallest non-trivial 
tree consisting of 3 leaves. Based on these observa- 
tions, we believe that Affine-DO is producing near- 
optimal solutions, with approximations within 10% for 
sequences with small divergence, and within 30% for ran- 
dom sequences, for which Affine-DO produced the worst 
solutions. 

Affine-DO is well suited for the GTAP under affine 
sequence edit distances, and yields significantly better 
results when augmented with iterative methods. The main 
open question is whether or not there exists a guaranteed 
bound for DO or Affine-DO. Then, if the answer is pos- 
itive, whether or not it is possible to improve the PTAS 
using these ideas. Additionally, many of these ideas can be 
applied for true simultaneous tree and alignment estima- 
tion under other optimality criteria such as ML and MAP. 
Their use under these different optimality criteria remains 
to be explored. 
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